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THE  USE  OF  THE  SEQUENTIAL  PROBABILITY  RATIO  TEST 
IN  MAKING  GRADE  CLASSIFICATIONS  IN  CONJUNCTION 
WITH  TAILORED  TESTING 


In  many  testing  applications,  the  major  use  of  the  obtained  score  is  to 
classify  a  person  as  being  above  or  below  some  criterion  score.  Examples  of 
such  uses  of  test  results  include  the  screening  of  job  applicants  and  the 
classification  of  students  as  masters  and  non-masters  when  using  the  mastery 
learning  paradigm  (Bloom,  1971).  For  such  applications  it  is  not  necessarily 
required  that  the  person's  ability  be  accurately  estimated,  but  only  that  the 
measurements  be  sufficiently  precise  that  the  examinees  can  be  accurately 
classi fied . 

When  making  such  classifications,  the  accuracy  of  measurement  required 
in  making  the  decision  is  dependent  upon  how  far  from  the  cutting  score  the 
person  is  located.  If  the  examinee  is  far  above  or  below  the  cutting  score, 
minimal  accuracy  will  be  required.  If  the  examinee  is  close  to  the  cutting 
score,  high  precision  will  be  required.  Since  the  accuracy  of  an  ability 
estimate  is  dependent  to  a  large  extent  on  test  length,  it  follows  that  shorter 
tests  can  be  used  if  a  person's  ability  were  a  substantial  distance  from  the 
cutting  score.  Depending  on  the  number  of  individuals  who  are  far  from  the 
cutting  score,  the  average  length  of  test  needed  for  classification  might  be 
substantial ly  reduced  over  what  is  commonly  used. 

Based  on  this  analysis,  an  optimal  procedure  for  testing  examinees  for 
classification  purposes  would  be  to  check  the  accuracy  of  classification  af¬ 
ter  each  item  is  administered.  If  the  accuracy  were  sufficiently  high,  test¬ 
ing  could  stop.  If  the  accuracy  were  not  high  enough,  another  item  would  be 
admi ni stered . 

Exactly  this  type  of  procedure  was  developed  by  Wald  (1947)  to  assist  in 
quality  control  work  during  World  War  II.  His  procedure  was  designed  to  de¬ 
termine  whether  a  batch  of  parts  was  acceptable  based  on  whether  it  contained 
a  sufficiently  low  number  of  defectives.  The  basic  concept  behind  the  pro¬ 
cedure  is  to  take  an  observation  from  the  batch  and  determine  the  probability 
of  the  observation  under  the  hypothesis  of  an  acceptable  or  unacceptable  batch. 
A  ratio  is  formed  by  dividing  the  probability  of  the  observation  coming  from 
an  acceptable  batch  5y  the  probability  of  it  coming  from  an  unacceptable  batch. 
If  the  ratio  is  sufficiently  large,  the  batch  is  considered  acceptable  and  if 
it  is  sufficiently  small,  the  batch  is  considered  unacceptable.  If  the  ratio 
is  near  1.0,  another  observation  is  randomly  selected.  A  new  ratio  is  then 
formed  using  all  of  the  previous  observations.  The  process  continues  until  a 
decision  is  reached.  Because  of  the  sequential  nature  of  the  process,  it  has 
been  labeled  the  Sequential  Probability  Ratio  Test  (SPRT). 

Since  its  development,  the  SPRT  has  been  widely  used  for  quality  control 
work  (Govindara julu,  1975).  However,  only  recently  has  it  appeared  in  the 
mental  testing  literature.  Ferguson  (1970)  used  the  SPRT  procedure  to  deter¬ 
mine  whether  75  students  had  mastered  material  in  a  hierarchically  arranged 
set  of  instructional  units.  His  procedure  randomly  generated  items  by  computer 
using  item  forms  and  then  administered  the  items  using  a  computer  terminal. 

He  found  a  substantial  reduction  in  testing  time  and  in  the  number  of  items 
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required  to  make  a  decision.  The  procedure  was  found  to  be  in  99%  agreement 
with  the  longer  tests  traditionally  used  to  make  the  decisions. 

No  other  studies  were  found  that  actually  made  real  time  decisions  using 
the  SPRT  procedure.  However,  Epstein  &  Knerr  (1978)  did  present  the  results 
of  a  real  data  simulation  using  Army  proficiency  testing  response  data.  They 
found  that  only  33%  as  many  items  were  needed  for  the  SPRT  based  procedure 
without  loss  in  decision  accuracy.  Sixtl  (1974),  Kalish  (1980),  and  Kingsbury 
and  Weiss  (1980)  present  the  results  of  simulation  studies  showing  that  the 
SPRT  procedures  result  in  a  substantial  reduction  in  the  number  of  items  re¬ 
quired  to  make  decisions.  Thus,  all  the  research  to  date  supports  the  conten¬ 
tion  that  SPRT  based  procedures  lead  to  increased  testing  efficiency. 

Despite  the  promising  results  reported  in  the  studies  listed  above,  none 
of  the  procedures  described  take  full  advantage  of  the  quality  items  in  the 
item  pool.  That  is,  by  randomly  selecting  items,  the  best  items  for  making 
the  classification  decision  may  not  be  administered.  A  better  procedure  would 
be  to  select  the  items  from  the  item  pool  that  would  be  most  informative  for 
making  the  decision  using  a  tailored  testing  paradigm.  Reckase  (1978)  has 
shown  that  such  a  procedure  could  be  used  with  the  SPRT  as  long  as  local  in¬ 
dependence  could  be  assumed.  In  a  series  of  simulation  studies  (Reckase,  1980a, 
1980b),  he  demonstrated  that  SPRT  procedures  will  work  with  tailored  testing. 
Further,  a  three-parameter  logistic  based  procedure  was  found  to  give  better 
results  than  a  one-parameter  logistic  based  procedure. 

With  the  positive  results  obtained  at  this  time  it  seems  prudent  to  eval¬ 
uate  the  quality  of  SPRT/tailored  testing  procedures  for  actual  decisions.  The 
purpose  of  this  report  is  to  present  some  results  of  the  operation  of  the  SPRT/ 
tailored  testing  hybrid  in  the  context  of  grade  classification.  Further,  one- 
parameter  and  three-parameter  logistic  model  based  procedures  will  be  compared 
on  the  basis  of  decision  consistency.  The  overall  criterion  for  success  will 
be  a  comparison  with  traditional  grading  procedures. 

The  SPRT  Procedure 


The  SPRT  procedure  has  been  described  in  detail  elsewhere  (Wald,  1947; 
Epstein  &  Knerr,  1978;  Reckase,  1980a)  so  only  a  brief  description  will  be  given 
here.  The  basic  equations  will  be  presented  along  with  the  procedures  for  de¬ 
scribing  the  characteristics  of  the  decision  making  process. 

As  described  above,  the  basic  philosophy  behind  the  SPRT  procedure  Is  to 
determine  the  probability  of  the  observed  responses  for  two  alternative  hypo¬ 
theses  and  then  form  the  ratio  of  the  probabilities.  A  large  ratio  favors  one 
of  the  hypotheses  and  a  small  ratio  favors  the  other.  For  example,  if  is 
the  hypothesis  that  the  ability  (9)  for  a  person  is  equal  to  0,,  and  H2  is  the 
hypothesis  that  the  ability  equals  the  probability  of  the  obtained  responses, 
x.j ,  x2,  .  .  . ,  xn>  given  these  hypotheses  would  be: 

n 

P(x1 ,  x2,  .  .  .,  xnj  9.j)  =  n^p(x.  |01 )  (1) 

n 

.  ., x  | e- )  =  n  p(x.|e?) 
n  c  i=l  1  c 
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under  the  local  independence  assumption  of  latent  trait  theory.  The  values 
of  P(x . | Op )  would  be  computed  using  the  appropriate  latent  trait  model  assuming 
known  ’item  parameters  from  a  previous  item  calibration.  Assuming  O.j<02,  the 
probability  ratio  would  then  be  formed  as 

P(x,,  Xp  t  ...»  x  1 0, ) 

X-  - - - — - -n-i-  •  (3) 

P(x] ,  x2,  .  .  .,  xn|02) 

If  this  ratio  were  sufficiently  large  H2  would  be  rejected,  and  if  the  ratio 
were  sufficiently  small  H.  would  be  rejected.  The  determination  of  what  con¬ 
stitutes  large  and  small  depends  upon  the  error  rates  that  are  considered  ac¬ 
ceptable. 

Suppose  u  is  the  probability  of  accepting  H.  when  EL  is  really  true  and  3 
is  the  probability  of  accepting  H2  when  H.  is  really  true.  Wald  (1947)  has 
shown  that  a  good  approximation  to  the  decision  points  needed  for  the  probabil¬ 
ity  ratio  (Equation  3)  can  be  obtained  by  the  following  two  expressions: 


Upper  decision  point  =  A  = 

(4) 

and 

Q 

Lower  decision  point  =  B  = 

(5) 

Thus,  if  Equation  3  gives  a  result  larger  than  A,  H,  should  be  accepted  with 
an  error  rate  of  approximately  a,  and  if  the  expression  yields  a  value  less  than 
B ,  H2  should  be  accepted  with  an  error  rate  of  approximately  3. 

The  procedure  described  above  assumes  that  a  decision  is  to  be  made  between 
two  simple  hypotheses:  H.:0=B.  or  H2:9=82.  Wald  (1947)  has  generalized  this 
procedure  to  making  decisions  concerning  complex  hypotheses  such  as  Hq:0<0  and 
H.  .  This  is  a  much  more  useful  set  of  hypotheses  because  it  matches  the 
d^cisiSn  process  used  in  making  classi fications  above  or  below  a  criterion  score. 

In  order  to  test  a  complex  hypothesis  using  the  SPRT,  an  indifference  region 
must  first  be  specified  around  the  cutting  score, 0,  for  the  decision.  The  in¬ 
difference  region  is  the  area  around  the  cutting  score  in  which  either  classifi¬ 
cation  is  considered  equally  good.  For  example,  if  0  is  the  cutting  score  for 
making  the  decision,  persons  sufficiently  close  to  0  could  be  classified  either 
high  or  low  without  appreciable  loss.  Sufficiently  close  is  defined  here  as 
being  between  0.  and  02  when  O,>0  >0„ .  If  a  person  were  outside  the  region  from 
(i.|  to  02  and  were  mi sclassified,  the  error  would  be  considered  serious. 

The  use  of  the  SPRT  to  test  complex  hypotheses  works  the  same  as  for  the 
simple  hypotheses  except  that  the  limits  of  the  indifference  region  are  used  in 
Equation  3  to  form  the  probability  ratio  instead  of  the  hypothesized  true  values. 
The  upper  and  lower  decision  points  for  the  test  are  determined  in  exactly  the 
same  way  as  before  (Equations  4  and  5).  However,  now  the  operation  of  the  SPRT 
is  controlled  not  only  by  the  a  and  3  error  rates,  but  also  by  the  width  of  the 
indifference  region.  The  higher  the  error  rates  and  the  wider  the  indifference 
region,  the  fewer  the  items  that  need  to  be  administered. 
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The  quality  of  operation  of  the  SPRT  procedure  is  usually  judged  on  the 
basis  of  two  mathematical  functions  called  the  operating  characteristic  (OC) 
function  and  the  average  sample  number  (ASN)  function.  The  OC  function  is 
defined  as 

0C(6)  =  P(classified  below  6  |6). 

c  t 

This  function  should  have  values  close  to  1.0  for  9<0  and  values  close  to  0.0 
for  0>0  .  To  the  extent  that  this  function  drops  quickly  from  a  value  near  1.0 
to  nearc0.0  in  the  indifference  region,  the  SPRT  procedure  is  working  well. 

The  ASN  function  is  defined  as  the  average  number  of  observations  needed 
to  make  a  decision  as  a  function  of  0.  This  function  is  typically  peaked,  with 
high  values  near  the  cutting  score  and  decreasing  values  with  increased  distance 
from  the  cutting  score.  Both  the  OC  function  and  the  ASN  function  are  dependent 
on  the  size  of  the  error  rates  and  the  width  of  the  indifference  region.  A 
narrow  indifference  region  and/or  low  error  rates  result  in  a  steep  OC  function 
and  require  a  large  number  of  observations  for  decisions.  High  error  rates  and/ 
or  a  wide  indifference  region  flatten  the  OC  function  and  reduce  the  number  of 
observations  required.  Thus,  the  price  paid  for  high  precision  is  a  greater 
number  of  observations.  More  detailed  information  concerning  the  OC  and  ASN 
functions  can  be  found  in  Wald  (1947),  Reckase  (1980a),  or  Epstein  and  Knerr 
(1978). 

Tailored  Testing  Procedure 

Tailored  testing  procedures  are  defined  by  their  methods  of  item  selection 
and  ability  estimation.  The  procedure  used  in  this  study  selects  items  to  maxi¬ 
mize  the  value  of  the  information  function  (Birnbaum,  1968)  at  the  previous 
ability  estimate.  Ability  was  estimated  using  an  empirical  maximum  likelihood 
approach.  The  procedure  is  described  in  detail  by  McKinley  &  Reckase  (1980),  so  • 

it  will  not  be  described  again  here.  The  above  tailored  testing  procedure  was 
used  with  both  the  one-parameter  logistic  (1PL)  and  the  three-parameter  logistic 
(3PL)  models  in  the  study  reported  here. 

Tailored  Testing/SPRT  Hybrid 

The  procedure  used  to  administer  the  test  items  in  this  study  used  compo¬ 
nents  of  both  tailored  testing  methodology  and  the  SPRT.  Items  to  be  adminis¬ 
tered  in  the  process  of  the  computerized  test  were  selected  using  the  maximum 
information  criterion  (Birnbaum,  1968;  McKinley  &  Reckase,  1980).  After  the 
response  to  each  item  was  obtained,  the  value  of  the  probability  ratio  (Equation 
3)  was  computed  and  a  decision  was  made  to  classify  high,  classify  low,  or  to 
administer  another  item.  If  another  item  were  to  be  administered,  a  maximum 
likelihood  ability  estimate  was  obtained  and  a  new  item  was  selected  to  maximize 
the  information  function  at  that  ability  estimate  and  administered  to  the  exami¬ 
nee.  The  process  continued  until  a  classification  decision  had  been  made  or 
until  20  items  had  been  administered.  After  20  items,  ratios  above  1.0  resulted 
in  a  high  classification,  and  ratios  below  1.0  resulted  in  low  classification. 
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Research  Design 

The  purpose  of  the  research  reported  here  was  to  compare  1PL  and  3PL  based 
procedures  for  making  classification  decisions  using  the  SPRT.  Since  the  true 
classifications  were  unknown,  a  consistency  of  classification  design  was  used 
as  a  criterion  for  evaluation.  To  facilitate  the  comparison  of  decision  con¬ 
sistency  a  test-retest  design  was  used  in  which  tailored  tests  based  on  both 
the  1  PL  and  3PL  models  were  administered  to  the  same  individuals  in  two  sessions 
one  week  apart.  In  the  first  session  the  1PL  and  3PL  tailored  tests  were  ad¬ 
ministered  as  described  above  without  a  break  in  between.  From  the  student's 
point  of  view,  only  one  test  was  administered.  In  the  second  session,  the  same 
procedure  was  followed,  only  the  order  of  presentation  of  the  1PL  and  3PL  pro¬ 
cedures  was  reversed  to  counterbalance  fatigue  effects.  The  initial  order  of 
presentation  of  the  1  PL  and  3PL  procedures  was  randomly  assigned  to  the  students. 

Within  the  tailored  tests,  three  grade  placement  decisions  were  made  using 
the  SPRT  procedure.  Based  on  the  test  information,  students  were  placed  above 
or  below  the  A/B  grade  cutoff,  the  B/C  grade  cutoff,  and  the  C/D  grade  cutoff. 
Thus,  if  a  student  were  classified  below  the  A/B  cutoff,  and  above  the  B/C  cut¬ 
off,  a  grade  of  B  would  be  assigned.  The  grade  cutoffs  for  the  study  were  set 
to  be  consistent  with  those  used  on  the  traditional  test  using  the  test  charac¬ 
teristic  curve. 

Before  the  cutoffs  could  be  set,  the  traditional  test  first  had  to  be  linked 
to  the  tailored  testing  item  pool.  This  was  done  so  that  the  cutoffs  determined 
from  the  traditional  test  would  be  on  the  same  scale  as  the  tailored  test  ability 
estimates.  The  linking  was  performed  using  the  major  axis  method  for  the  1PL 
model,  and  the  maximum  likelihood  method  for  the  3PL  model.  See  Reckase  (1979a) 
for  a  more  detailed  description  of  these  procedures. 

The  traditional  test  used  as  a  basis  for  the  grade  cutoffs  was  a  50  item 
multiple  choice  test  over  the  area  of  classroom  evaluation  procedures.  The  test 
and  the  population  of  students  who  took  part  in  the  study  were  from  an  intro¬ 
ductory  course  on  educational  measurement  techniques.  The  grade  classification 
region  for  the  traditional  test  in  terms  of  raw  scores  were:  42-50,  A;  33-41,  B; 
29-32,  C;  and  28  and  below,  0.  Based  on  these  score  ranges,  the  A/B  cutoff  was 
set  at  41  !2,  the  B/C  cutoff  at  32’2,  and  the  C/D  cutoff  at  28'2.  The  1PL  ability 
scale  cutoffs  corresponding  to  the  raw  score  cutoffs  were  A/B,  2.24;  B/C,  .95; 
and  C/D,  .46.  The  cutoffs  on  the  3PL  ability  scale  were:  A/B,  .78;  B/C,  -.85; 
and  C/D,  -1  .39.  These  values  were  determined  by  finding  the  points  in  the  latent 
trait  scales  that  were  equivalent  to  the  raw  score  points. 

Along  with  the  cutting  points,  an  indifference  region  and  the  a  and  8  error 
rates  were  needed  to  totally  specify  the  SPRT  procedure.  A  reasonable  indiffer¬ 
ence  region  for  the  test  was  thought  to  be  one  standard  error  of  measurement  on 
either  side  of  the  cutting  point.  Based  on  the  traditional  test  reliability  of 
.60  for  the  sample  of  students  used  in  the  study,  the  standard  error  of  measure¬ 
ment  in  1PL  and  3PL  ability  units  was  .45.  Thus,  the  indifference  regions  were 
set  at  A/B,  2.69  to  1.79;  B/C,  1.40  to  .50;  and  C/D,  .91  to  .01  for  the  1  PL  pro¬ 
cedure  and  A/B,  .23  to  1.33;  B/C,  -1,30  to  -.40;  and  C/D,  -1.84  to  -.94  for  the 
3PL  procedure.  The  differences  in  indifference  regions  for  the  two  procedures 
were  due  to  differences  in  the  way  the  origins  of  the  ability  scales  were  defined. 


Since  it  was  considered  a  more  serious  error  to  classify  someone  high  in¬ 
correctly  than  low  incorrectly,  a  was  set  at  .02  and  8  was  set  at  .10.  Using 
Equations  4  and  5,  the  decision  points  for  the  SPRT  were  computed  to  be  A=45 
and  B=.102.  This  resulted  in  a  classification  in  the  higher  grade  category  if 
Equation  3  resulted  in  a  value  greater  than  45,  in  the  lower  grade  category  if 
the  value  was  below  .102,  and  continued  testing  if  the  result  was  between  45 
and  .102.  The  same  A  and  B  values  were  used  for  both  the  1  PL  and  3PL  procedures. 

The  sample  used  in  this  study  consisted  of  88  student  volunteers  from  an 
undergraduate  introductory  measurement  course.  Of  the  88  students,  21  were  male 
and  67  female.  The  group  consisted  of  19  juniors,  67  seniors,  and  2  graduate 
students.  The  tailored  tests  were  administered  the  week  following  a  classroom 
test  over  the  same  content.  The  examinees  were  told  that  the  tailored  test  score 
would  be  substituted  for  the  classroom  test  score  if  they  performed  better  on  the 
tailored  test,  and  that  they  would  receive  extra  credit  points  for  completing  the 
requirements  of  the  study. 


Analyses 

The  major  analysis  performed  in  this  study  was  the  comparison  of  the  grade 
classifications  over  the  test-retest  period.  This  analysis  was  to  show  which 
procedure  (1PL  or  3PL)  gave  more  consistent  grade  classification  over  the  one 
week  time  period.  Since  the  grade  scale  yields  mainly  categorical  results,  a 
phi  coefficient  derived  from  the  chi-square  contingency  table  was  used  for  this 
analysis.  The  same  analysis  was  also  performed  to  determine  which  procedure 
made  grade  classifications  that  were  more  similar  to  those  obtained  from  a  tra¬ 
ditional  classroom  test. 

Along  with  the  above  analyses,  the  distributions  of  grades  for  the  two 
procedures  were  determined  and  compared.  The  number  of  items  required  for  a 
decision  were  also  tabulated  for  each  procedure  and  the  mean  number  of  items 
required  were  compared  using  a  two-way  ANOVA.  Session  and  procedure  were  the 
independent  variables  in  this  analysis,  with  repeated  measures  over  both  ses¬ 
sion  and  procedure. 


Resul ts 


The  direct  result  of  the  tailored  testing  procedure  in  this  study  is  the 
classification  of  students  into  grade  categories  using  the  SPRT  paradigm.  The 
results  of  this  grade  classification  for  the  1PL  and  3PL  tailored  testing  pro¬ 
cedure,  and  the  traditional  classroom  test  are  shown  in  Table  1.  This  table 
presents  the  frequency  distribution  of  the  grades  for  each  procedure  and  each 
testing  session.  The  means  and  standard  deviations  are  also  presented  to  sum¬ 
marize  the  distributions  even  though  the  data  are  only  ordinal. 

From  these  results,  a  tendency  can  be  seen  for  the  1  PL  procedure  to  grade 
slightly  easier  than  the  3PL  procedure.  The  traditional  test  assigned  the 
highest  average  grade  of  all  the  procedures.  This  can  probably  be  explained  by 
the  fact  that  the  classroom  test  was  the  test  studied  for  and  It  was  taken  first. 
The  standard  deviations  of  grades  for  the  1PL  and  3PL  procedures  were  about  the 
same,  with  a  slight  increase  in  the  second  testing  session.  The  traditional 
test  had  the  smallest  standard  deviation  of  all  of  the  procedures. 
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Table  1 

Grade  Distributions  for  the  1  PL  and  3PL  Tailored  Tests 
and  the  Traditional  Classroom  Test 


Session 

Grade 

Procedure 

1  PL 

3PL 

Tradi tional 

A(4 ) 

13 

6 

8 

1 

B  ( 3 ) 

60  x=2 .78 

58  x=2  .59 

78  x=2 .91 

C(2) 

20  s.d.=,75 

26  s.d.=,75 

10  s.d.=.56 

DO) 

7 

10 

4 

A(4) 

18 

12 

2 

B  ( 3 ) 

54  x=2 . 78 

50  x=2.65 

C(2) 

17  s.d.=.88 

27  s.d.=.83 

D  ( 1  ) 

11 

10 

Note:  The 

values  presented 

in  the  table  are 

percentages  of  88 

cases . 

The  results  of  the  consistency  of  classification  analysis  are  presented 
in  Table  2  along  with  a  comparison  with  the  grades  assigned  by  the  traditional 
classroom  exam  over  the  same  course  content  and  the  final  grade  in  the  course. 

As  can  be  seen  from  this  table,  the  consistency  of  the  3PL/SPRT  procedure  was 
substantially  higher  than  the  1PL/SPRT  procedure  (phi  =  .938  vs.  .662;  t  =  5.19, 
P<. 01). 


Table  2 

% 

Phi  Coefficients  Showing  the  Consistency 
of  Grade  Classifications  and  the  Relationship 

With  Traditional  Grading  Practices 

Tesjt  _ _ 

Test 

1  PL-1  1  PL-2  3PL-1 

3PL-2 

Course 

Exam 

Final 

Grade 

1  PL-1 

.662  .340 

.489 

.486 

.679 

1  PL-2 

.448 

.645 

.495 

.710 

3PL-1 

.938 

.376 

.461 

3PL-2 

.490 

.649 

Note:  A1 1  phi 

coefficients  are  based  on  88 

cases . 
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The  relationship  between  the  tailored  testing  results  and  the  traditional 
grading  schemes  show  a  more  confusing  pattern.  The  1PL  procedure  had  a  corre¬ 
lation  of  around  .5  with  the  exam  grades  and  about  .7  with  the  final  grades. 
This  was  unexpected  because  the  course  exam  was  on  the  same  material  as  the 
tailored  test,  while  the  final  grade  was  based  on  a  composite  of  three  exams 
over  different  content  areas.  The  correlations  of  the  3PL  procedure  with  the 
course  grade  gave  a  similar  pattern  of  results,  but  the  grades  assigned  by 
the  first  3PL  session  had  lower  phi  coefficients.  The  results  from  the  second 
testing  were  about  the  same  magnitude  as  the  1  PL  results. 

The  data  on  the  mean  number  of  test  items  required  to  make  the  grade 
classifications  are  presented  in  Table  3.  Since  the  tailored  testing  proce¬ 
dures  were  terminated  if  a  grade  decision  were  not  made  at  or  before  20  items, 
the  table  also  gives  the  percent  of  cases  making  classifications  in  20  items 
or  less.  As  can  be  seen  from  this  table,  the  1  PL  procedure  seldom  was  able  to 
make  classification  decisions  in  20  items  or  less,  while  about  half  the  time 
the  3PL  procedure  could.  Overall,  the  3PL  procedure  required  significantly 
fewer  items  to  make  a  decision  than  the  1PL  procedure  ( x=l 3 . 41  vs.  18.14). 
Significantly  fewer  items  were  also  required  for  the  second  testing  session. 
The  ANOVA  on  the  number  of  items  required  for  classification  is  given  in 
Table  4.  The  low  number  of  items  required  for  a  grade  classification  is  even 
more  dramatic  when  compared  to  the  50  items  used  to  make  the  grade  classifi¬ 
cations  with  the  traditional  test. 

Table  3 

Average  Number  of  Items  Required 
To  Make  Grade  Classifications 
by  Procedure  and  Session 


Procedure 


Percent  using  2(J 
items  or  less 

5.70 

6.80 

50.00 

53.40 

x  for  cases 

20  items  or  less 

11.20 

14.50 

9.02 

O 

OO 

• 

r— 

x  for  all  cases  (N=88) 

18.61 

17.66 

13.97 

12.85 

S.D.  for  all  cases 

2.85 

4.00 

4.94 

5.00 

W - —  “ 
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Table  4 

ANOVA  Results  on  Number  of  Items  Administered  With 
Model  and  Session  as  Independent  Variables  and 
Repeated  Measures  on  Both  Variables 


Source 

SS 

df 

MS 

F 

P 

Model 

1966.55 

1 

1966.55 

96.55 

.00 

Session 

94.10 

1 

94.10 

6.59 

.01 

Model  x  Session 

.56 

1 

.56 

.03 

.85 

Error  (model ) 

1771 .95 

87 

20.37 

Error  (session) 

1242.40 

87 

14.28 

Error  (interaction) 

1397.94 

87 

16.07 

Discussion 


The  major  thesis  of  this  paper  is  that  the  number  of  items  required  to 
make  a  decision  concerning  the  class i fication  of  individuals  above  or  below  a 
cutting  score  can  be  substantially  reduced  from  the  number  traditionally  used. 
This  can  be  done  because  abilities  far  removed  from  the  cutting  score  need  not 
be  measured  as  precisely  as  those  who  are  near  the  cutting  score.  In  order  to 
implement  a  testing  procedure  that  can  modify  the  length  of  the  test  as  a  func¬ 
tion  of  the  examinee's  ability,  a  tailored  testing  procedure  based  on  maximum 
information  item  selection  and  maximum  likelihood  ability  estimation  (McKinley 
and  Reckase,  1980)  was  combined  with  Wald's  (1947)  Sequential  Probability  Ratio 
Test. 


Common  wisdom  in  test  theory  indicates  that  in  order  to  accurately  classify 
individuals  into  two  qroups,  the  items  should  be  selected  to  be  most  informative 
at  the  cutting  score  (Lord  &  Novick,  1968).  This  could  be  done  in  this  situation 
by  selecting  items  with  maximum  information  at  the  cutting  score  and  using  the 
usual  SPRT  procedure.  However,  in  this  case  three  cutting  scores  were  present 
(A/B,  B/C,  C/D)  so  the  usual  tailored  testing  item  selection  procedure  of  choosing 
items  to  give  maximum  information  at  the  most  recent  ability  estimate  was  used. 

Beyond  demonstrating  the  economics  of  the  tailored  testing/SPRT  hybrid  over 
traditional  testing,  the  purpose  of  this  paper  was  to  compare  tailored  tests 
based  on  the  1PL  model  with  tailored  tests  based  on  the  3PL  model.  The  results 
showed  that  the  3PL  procedure  is  clearly  more  consistent  than  the  1  PL  procedure, 
but  that  the  relationship  to  the  grades  based  on  the  classroom  tests  was  about 
the  same  or  a  little  worse  for  the  3PL  procedure.  This  may  be  explained  by  the 
fact  that  the  1PL  model  tends  to  give  ability  estimates  that  are  the  sum  of  the 
components  in  a  test  while  the  3PL  based  tests  tend  to  give  ability  estimates 
that  are  more  pure  measures  of  the  first  principal  component  of  a  test  (see 


Reckase,  1979,  for  a  more  thorough  discussion).  The  larger  correlations  with 
the  final  grades  than  with  the  exam  grades  is  probably  due  to  the  higher  relia¬ 
bility  of  the  final  composite  based  on  the  sum  of  three  exams.  The  generally 
low  correlations  with  the  course  grades  were  probably  due  to  the  low  reliability 
of  the  course  exams  (.60)  and  differences  in  method  variance. 

The  test  length  analysis  resulted  in  several  interesting  findings.  First, 
the  1  PL  based  procedure  had  great  difficulty  in  classifying  students  into  grade 
categories  with  less  than  20  items.  The  three  parameter  procedure  could  make 
the  classification  with  less  than  20  items  about  half  the  time.  On  the  average, 
the  3PL  procedure  required  about  5  items  less  for  classification  than  the  1PL 
procedure.  This  shorter  test  length  with  higher  consistency  of  classification 
is  probably  a  result  of  the  advantage  obtained  by  using  the  item  discrimination 
parameter  in  item  selection.  Since  the  1  PL  procedure  assumes  that  all  items  are 
of  equal  discriminating  power,  only  the  nearness  of  the  item  difficulty  parameter 
to  the  most  recent  ability  estimate  affects  item  selection.  In  selecting  items 
using  maximum  information  with  the  3PL  procedure,  discrimination,  guessing,  and 
difficulty  parameters  contribute  to  selection.  This  results  in  the  administra¬ 
tion  of  higher  quality  items  overall.  The  fewer  test  items  required  in  the 
second  session  may  be  due  to  greater  familiarity  with  the  testing  system  result¬ 
ing  in  fewer  mistakes  in  using  the  terminals.  McKinley  &  Reckase  (1980)  give 
more  details  concerning  the  characteristics  of  the  items  actually  administered 
in  this  study. 


Summary  and  Conclusions 

The  purpose  of  this  paper  has  been  to  compare  two  tailored  testing  based 
decision  making  procedures  using  the  Sequential  Probability  Ratio  Test.  The 
procedures  were  based  on  the  one-parameter  logistic  model  and  the  three-para¬ 
meter  logistic  model.  The  procedures  were  also  compared  to  traditional  paper 
and  pencil  test  based  grades. 

The  results  of  the  study  showed  that  the  3PL  based  tailored  test/SPRT  pro¬ 
cedure  had  higher  decision  consistency  and  required  fewer  test  items  than  the 
1PL  based  procedure.  The  tailored  testing/SPRT  procedure  also  required  sub¬ 
stantially  fewer  items  than  the  traditional  classroom  test  ( x=l 3.4  vs.  50). 

These  results  indicate  that  a  substantial  increase  in  efficiency  can  be  obtained 
through  the  use  of  tailored  testing/SPRT  procedures,  but  that  the  grades  assigned 
may  not  be  the  same  as  those  given  using  a  traditional  method.  Of  the  two  pro¬ 
cedures  used  in  this  study,  the  3PL  based  method  was  superior  to  the  1  PL  method 
in  decision  consistency  and  number  of  items  required.  Both  procedures  had  about 
the  same  correlations  with  the  traditional  grades. 
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